ggplot2
Statement| R | Python | Notes |
|---|---|---|
character |
Str | |
complex |
??? | includes imaginary numbers |
numeric |
Float | |
integer |
Int | |
logical |
Bool |
| Function | Description |
|---|---|
is.datatype |
will return TRUE or FALSE |
as.datatype |
will convert from the original datatype to the one specified |
class(x) |
will return the datatype of x |
is.na |
tests for NA (missing) values |
is.null |
tests for NULL values |
Can coerce data from lower end without loss of precision to uppper end but not the other way around.
| R | Python | Notes |
|---|---|---|
factor |
not available | kinda like Python dictionaries but has levels as well, can be ordered |
date |
datetime | |
vector |
list (all same datatype) | all R primitives are technically vectors and can have length. Values in a vector must be all the same datatype |
list |
list | |
matrix |
not available - Maybe with Pandas? | all data must be of the same type |
data.frame |
not available - Need pandas | Each column can be a different datatype |
There is no is.date function.
The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.
It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical models you will develop in the future treat both types differently.
| R Comand | Description |
|---|---|
x <- 0:10 |
Assigns numbers 0 through 10 to x in a vector |
y <- c(1,2,5,3,7,8,4,9,0) |
Assigns the vector to y |
seq(from , to, by) |
generate a sequence indices <- seq(1, 10, 2) # indices is c(1, 3, 5, 7, 9) |
rep(x, ntimes) |
repeat x n times y <- rep(1:3, 2) # y is c(1, 2, 3, 1, 2, 3) |
| Function | Description |
|---|---|
names(v) <- c('one', 'two', 'three') |
Assigns names to the values in the vector |
names(vector) |
Returns the names of all the values in the vector |
names(v)[3] |
returns the name of the the value in the third index of the vector |
v['one'] |
returns the name and the value in the index named one |
v[c("Mon", "Tues", "Wed")] |
returns the name and the value in the indices named “Mon”, “Tues”, and “Wed” |
length(vector) |
returns the length of the vector |
cut(x, n) |
divide continuous variable in factor with n levels y <- cut(x, 5) |
vector <- c(1, 2, -4, 5, -6)
selection_vector <- vector > 0
selection <- vector[selection_vector]
selection
## [1] 1 2 5
To create a factor, first create a vector with all your values, then
use the factor() function to convert it to a factor. To set
the levels of an ordinal categorical value while you are creating a
factor, use
factor(vector, order= TRUE, levels = c("Low", "Medium", "High")
To set the levels after the factor is already created, use
levels(factor) <- c("name1", "name2",...)
You can also use this to change the names of the levels. Watch out: the order with which you assign the levels is matters. Alternatively you can specify the associations like this:
levels(factor) <- c('F' = "Female", 'M' = "Male")
Lists can contain anything! (Just like Python)
# Vector with numerics from 1 up to 10
my_vector <- 1:10
# Matrix with numerics from 1 up to 9
my_matrix <- matrix(1:9, ncol = 3)
# First 10 elements of the built-in data frame mtcars
my_df <- mtcars[1:10,]
# names are optional but useful!
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list
## $vec
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## $df
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Indexing lists in R needs double brackets.
my_list[[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Other indexing syntax…
my_list[["vec"]]
## [1] 1 2 3 4 5 6 7 8 9 10
my_list$df
my_list[['df']][2:3,]
# chain select by names
my_list[[c("df", "mpg")]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
If you have a list of lists a and want to add a list
b to it, you can use c(a, list(b))
a <- list(1,2,3)
b <- list(4,5,6)
c <- c(a, list(b))
c
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [[4]][[1]]
## [1] 4
##
## [[4]][[2]]
## [1] 5
##
## [[4]][[3]]
## [1] 6
If you have a list of lists a and want to add each
element of list b to it, you can use
c(a, b)
d <- c(a, b)
d
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] 5
##
## [[6]]
## [1] 6
To select an item in a list inside another list use chained selection…
c[[c(4,2)]]
## [1] 5
x <- list(a = list(d=1,e=10,f=100), b = list(d=2,e=20,f=200), c = list(d=3,e=30,f=300))
x[["a"]]
## $d
## [1] 1
##
## $e
## [1] 10
##
## $f
## [1] 100
`[[`(x, "a")
## $d
## [1] 1
##
## $e
## [1] 10
##
## $f
## [1] 100
lapply(x, `[[`, "f")
## $a
## [1] 100
##
## $b
## [1] 200
##
## $c
## [1] 300
# Dataframes can store vectors of different types
x <- 1:3
y <- 4:6
z <- c('seven', 'eight', 'nine')
df <- data.frame(x, y, z, stringsAsFactors = FALSE)
df
names(df) <- c('one', 'two', 'three')
df
# Matrices must be all of the same type
v <- 7:9
c <- c("one", "two", "three")
mat <- matrix(c(x, y, v), byrow = TRUE, nrow = 3)
mat
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
mat2 <- matrix(c(c, z), byrow = TRUE, nrow = 2)
mat2
## [,1] [,2] [,3]
## [1,] "one" "two" "three"
## [2,] "seven" "eight" "nine"
names(mat) <- c('one', 'two', 'three')
mat
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## attr(,"names")
## [1] "one" "two" "three" NA NA NA NA NA NA
| Function | Description |
|---|---|
nrow(dataframe) |
returns the number of rows |
ncol(dataframe) |
returns the number of columns |
str(dataframe) |
returns the structure of the dataframe |
dim(dataframe) |
Returns the dimensions of the dataframe |
head(df, 3) |
returns the first 3 rows of the dataframe |
tail(df, 5) |
returns the last 5 rows of the dataframe |
names(dataframe) |
Returns the names of the columns or variabes in the dataframe |
names(df)[3] |
returns the third column name |
names(df) <- c('one', 'two', 'three') |
Assigns names to the columns in the dataframe or values in a matrix |
rownames(matrix_df) <- row_names_vector |
Assigns names to the rows in the matrix/dataframe |
colnames(matrix_df) <- col_names_vector |
Assigns names to the columns in the matrix/dataframe |
rownames(dataframe) |
Returns the names of the rows in the dataframe |
colnames(dataframe) |
Returns the names of the columns in the dataframe |
rownames(df) <- NULL |
resets to generic index names |
colnames(df) <- NULL |
resets to generic names |
rowSums(df) |
Just what it sounds like |
colSums(df) |
Just what it sounds like |
rbind(df, df2) |
combines two dataframes or vectors adding the second one as additional rows to the first |
cbind(df, df2) |
combines two dataframes or vectors adding the second one as additional columns to the first |
dataframe$variable |
Returns all of the values in the specified variable as a vector |
df$totals <- df$var1 + df$var2 |
Creates a new column and puts the total of var1 and var2 in that column |
df[which.max(df$var),] |
Finds the row with max in specified variable column |
You can multiply each element in a matrix by te corresponding element
in another matirx using regular operations
i.e. matrix1 * matrix2
This is not the standard matrix multiplication for which you should
use %*% in R.
library(data.table, quietly = TRUE)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
x <- 1:3
y <- 4:6
z <- c('seven', 'eight', 'nine')
DT <- data.table(x, y, z) # strings are automatically characters not factors
DT
data.table 1.10.4.3 The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way Documentation: ?data.table, example(data.table) and browseVignettes(“data.table”) Release notes, videos and slides: http://r-datatable.com
Accessing data is a little different then data.frame.
| Function | Description |
|---|---|
DT[1:5, ] |
rows 1 - 5 |
DT[A>=7, ] |
All rows where column A >= 7 |
DT[ , B] |
only column B |
DT[ , list(B, D)] |
only column B and D |
x <- 5 or 5 -> x
a <- b <- 36
assign('y', 42)
Variable names can use any conbination of alphanumeric characters, periods and underscores, but they cannot start with a number or underscore.
Note - single and double quotes can be used interchangeably like in Python.
remove(var) or rm(var)
devtools::install_github("seankross/lego")
library(lego)
data(legosets)
table(legosets$Availability, useNA='ifany')
##
## LEGO exclusive LEGOLAND exclusive Not specified
## 695 2 1795
## Promotional Promotional (Airline) Retail
## 141 12 3120
## Retail - limited Unknown
## 403 4
table(legosets$Availability, legosets$Packaging, useNA='ifany')
##
## Blister pack Box Box with backing card Bucket Canister
## LEGO exclusive 45 147 0 1 0
## LEGOLAND exclusive 0 2 0 0 0
## Not specified 0 20 0 0 0
## Promotional 0 44 0 0 0
## Promotional (Airline) 0 11 0 0 0
## Retail 53 2575 16 30 78
## Retail - limited 2 302 1 5 0
## Unknown 0 1 0 0 0
##
## Foil pack Loose Parts Not specified Other Plastic box
## LEGO exclusive 0 71 7 5 1
## LEGOLAND exclusive 0 0 0 0 0
## Not specified 5 0 1739 0 6
## Promotional 0 1 0 3 2
## Promotional (Airline) 0 0 1 0 0
## Retail 285 0 0 28 0
## Retail - limited 1 0 0 0 1
## Unknown 0 0 0 0 0
##
## Polybag Shrink-wrapped Tag Tub
## LEGO exclusive 412 0 6 0
## LEGOLAND exclusive 0 0 0 0
## Not specified 24 0 0 1
## Promotional 90 0 0 1
## Promotional (Airline) 0 0 0 0
## Retail 4 18 0 33
## Retail - limited 86 0 0 5
## Unknown 3 0 0 0
prop.table(table(legosets$Availability))
##
## LEGO exclusive LEGOLAND exclusive Not specified
## 0.1126053143 0.0003240441 0.2908295528
## Promotional Promotional (Airline) Retail
## 0.0228451069 0.0019442644 0.5055087492
## Retail - limited Unknown
## 0.0652948801 0.0006480881
Good for Categorical Variables
Regular Plot
barplot(table(legosets$Availability), las=3)
Proportional Plot
barplot(prop.table(table(legosets$Availability)), las=3)
# Plot cummulative outcome of 1000 coin tosses
coins <- sample(c(-1,1), 1000, replace=TRUE)
plot(1:length(coins), cumsum(coins), type='l')
abline(h=0)
# same exact plot but change the y axis to show the total ramge of possibilities
plot(1:length(coins), cumsum(coins), type='l', ylim=c(-1000, 1000))
abline(h=0)
# Plot cummulative outcome of 100 coin tosses
coins <- sample(c(-1,1), 100, replace=TRUE)
plot(1:length(coins), cumsum(coins), type='l')
abline(h=0)
# Vaue at the end
cumsum(coins)[length(coins)]
## [1] -16
# Function to do the same as above 1000 times and record the ending value of each 100 coin tosses.
samples <- rep(NA, 1000)
for(i in seq_along(samples)) {
coins <- sample(c(-1,1), 100, replace=TRUE)
samples[i] <- cumsum(coins)[length(coins)]
}
head(samples, 30)
## [1] -2 -12 10 -12 16 6 -12 -20 22 4 6 8 -4 -2 0 -8 -12 2 6
## [20] 14 -8 -6 -4 24 -12 -14 -2 2 0 -4
mean(samples)
## [1] -0.27
For two or three Categorical Variables
library(vcd)
mosaic(HairEyeColor, shade=TRUE, legend=TRUE)
For quantitative variables
stripchart(legosets$Pieces)
For quantitative variable grouped by a categorical variable
par.orig <- par(mar=c(1,10,1,1))
stripchart(legosets$Pieces ~ legosets$Availability, las=1)
par(par.orig)
hist(legosets$Pieces)
With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.
hist(log(legosets$Pieces))
Histogram looks normal, but we can overlay a standard normal curve to help evaluation.
h <- hist(heights, xlim=c(60, 80))
x <- seq(min(heights)-5, max(heights)+5, 0.01)
y <- dnorm(x, mean(heights), sd(heights))
y <- y * diff(h$mids[1:2]) * length(heights)
lines(x, y, lwd=1.5, col='blue')
qqnorm(heights, cex=0.5, main='', axes=F, ylab='Male heights (in)', pch=19)
axis(1)
axis(2)
abline(mean(heights), sd(heights), col="blue", lwd=1.5)
qqnorm(samples)
DATA606::qqnormsim(samples)
normal_plot(mean = 0, sd = 1, cv = c(-1, 1))
plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')
plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')
For quantitative variables
scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
boxplot(scores, horizontal = TRUE)
boxplot(legosets$Pieces)
boxplot(log(legosets$Pieces))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn
plot(legosets$Pieces, legosets$USD_MSRP)
legosets[which(legosets$USD_MSRP >= 400),]
legosets[which(legosets$Pieces >= 4000),]
plot(legosets$Pieces, legosets$USD_MSRP)
bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),]
text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)
# If statement alone
if(condition=TRUE){
code to run
}
# If Else statement
if(condition=TRUE){
code to run
} else {
code to run if condition=FALSE
}
# If, Else If, Else statement
if(condition1=TRUE){
code to run if condtion1=TRUE
} else if(condition2=TRUE){
code to run if condtion1=FALSE but condition2=TRUE
} else {
code to run if both conditions=FALSE
}
while (condition) {
expr
increment
}
Can nest if statements inside
while (condition) {
if(condition=TRUE){
expr
}
increment
}
vec <- c(2, 3, 5, 7, 11, 13)
# Option 1
for (el in vec) {
print(el)
}
## [1] 2
## [1] 3
## [1] 5
## [1] 7
## [1] 11
## [1] 13
# Option 2
for (i in 1:length(vec)) {
print(vec[i])
}
## [1] 2
## [1] 3
## [1] 5
## [1] 7
## [1] 11
## [1] 13
# To access and change the elements in the list, you need to use the Option 2 approach above!!!
vec2 <- as.data.frame(vec)
for (i in 1:length(vec2)) {
vec2$date <- Sys.Date()
}
vec2
for (var in seq) {
expr
}
for (var in seq) {
if(condition=TRUE){
next #skips this loop if condition is met
}
expr
}
my_func <- function(arg1, arg2=DEFAULT){
code
}
Use to iterate without a for loop
# Can take a list or a vector as input
lapply(iterator, function)
# Always returns a list.
# If you don't want a list, do this...
unlist(lapply(iterator, function))
# Returns a vector
# To use lapply with a function that takes more than one argument
lapply(iterator, function, arg)
# only use if all items are of the same type.
# returns a named vector (unlists automatically)
sapply(vector, function, USE.NAMES=FALSE)
# USE.NAMES arg to get an unnamed vector
If each item returns a list of same length it returns a matrix
If each item returns a list of different lengths it returns a list of lists
This section taken from https://www.statmethods.net/management/functions.html with lots of additions by me.
Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.
| R Comand | Description |
|---|---|
ctrl + L |
to clear console |
ls() |
list the objects in memory to the console |
library() |
Lists the packages in your library |
search() |
Shows packages that are currently active |
install.package(package_name) |
installs package |
library(package_name) or
require(package_name) |
loads package into memory |
?function_name |
Displays the documentation for the function in the viewer window in RStudio |
apropos('func') |
to search for a function by only part of the name |
args(func) |
To get information about the function arguments |
search() |
To see a list of loaded packages (when you load a package you are adding it to your search list) |
getwd() |
get the working directory |
setwd("file\path") |
set the working directory |
vignette(package="package_name") |
To see a list of ‘vignettes’ or sample code for a package |
vignette("vin_name", package="pack_name") |
to view a specific vignette |
data(package="package_name")$results |
to see the datasets that come with a package |
data(dataset_name) |
to load data - note it does not show up in the “Data” section of
your environment in RStudio until you use the data in another function
like head(data) or dim(data) |
options(repos = c(CRAN = "http://cran.rstudio.com"))
Note: Putting parenthases around your code is equivalent to the print function
| Function | Description |
|---|---|
+, -, *, / |
addition, subtraction, multiplication, division |
x %% y |
modulo or remainder of division of x by y |
x %/% y |
integer division - number of times y goes into x without remainder |
x ^ y (or x ** y) |
exponentiation - x raised to the power y |
abs(x) |
absolute value |
sqrt(x) |
square root |
ceiling(x) |
ceiling(3.475) is 4 |
floor(x) |
floor(3.475) is 3 |
trunc(x) |
trunc(5.99) is 5 |
round(x, digits=n) |
round(3.475, digits=2) is 3.48 |
signif(x, digits=n) |
signif(3.475, digits=2) is 3.5 |
cos(x), sin(x), tan(x) |
also acos(x), cosh(x), acosh(x), etc. |
log(x) |
natural logarithm |
log10(x) |
common logarithm |
exp(x) |
e^x |
| Operator | Description |
|---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
exactly equal to |
!= |
not equal to |
!x |
Not x |
x | y |
x OR y |
x & y |
x AND y |
x %in% c(a, b, c) |
TRUE if x is in the vector c(a, b, c) |
isTRUE(x) |
test if X is TRUE |
any(v1 < v2) |
checks if any item in a vector is less than the corresponding item in a second vector |
all(v1 < v2) |
checks if all items in a vector are less than the corresponding items in a second vector |
identical(x, y) |
checks if the two items are identical |
| Function | Description |
|---|---|
nchar(x) |
returns the number of characters in x (works on character and numeric datatypes even withint vectors, will not work on factors) |
toupper(x) |
Uppercase |
tolower(x) |
Lowercase |
substr(x, start=n1, stop=n2) |
Extract or replace substrings in a character vector. x <- “abcdef” substr(x, 2, 4) is “bcd” substr(x, 2, 4) <- “22222” is “a222ef” |
grep(pattern, x , ignore.case=FALSE, fixed=FALSE) |
Search for pattern in x. If fixed=FALSE then pattern is a regular
expression. If fixed=TRUE then pattern is a text string. Returns
matching indices. grep(“A”, c(“b”,“A”,“c”), fixed=TRUE) returns 2 |
sub(pattern, replacement, x, ignore.case =FALSE, fixed=FALSE) |
Find pattern in x and replace with replacement text. If fixed=FALSE
then pattern is a regular expression. If fixed = T then pattern is a text string. sub(“\s”,“.”,“Hello There”) returns “Hello.There” |
gsub(pattern, replacement, x) |
Same as sub but replaces all not just first occurance in each item in your list |
strsplit(x, split) |
Split the elements of character vector x at split. |
strsplit("abc", "") |
returns 3 element vector “a”,“b”,“c” |
paste(..., sep="") |
Concatenate strings after using sep string to seperate
them. paste(“x”,1:3,sep=““) returns c(”x1”,“x2” “x3”) paste(“x”,1:3,sep=“M”) returns c(“xM1”,“xM2” “xM3”) paste(“Today is”, date()) paste(Year, Month, DayofMonth, sep=“-”) |
Basic statistical functions are provided in the following table. Each has the option na.rm to strip missing values before calculations. Otherwise the presence of missing values will lead to a missing result. Object can be a numeric vector or data frame.
| Function | Description |
|---|---|
min(x) |
minimum |
max(x) |
maximum |
sum(x) |
sum |
cumsum(x) |
running total (cummulative sum) |
diff(x) |
difference |
range(x) |
range |
mean(x, trim=0,<br>na.rm=FALSE) |
mean of object x # trimmed mean, removing any missing values and # 5 percent of highest and lowest scores mx <- mean(x,trim=.05,na.rm=TRUE) |
median(x) |
median |
var(x) |
variance |
sd(x) |
standard deviation of object(x). also look at var(x) for variance and mad(x) for median absolute deviation. |
summary(x) |
Returns Min, Max, 1st Qtr, 3rd Qtr, Median, Mean and num of missing values - N0 SD. Can be used with factors, but not categorical vectors |
quantile(x, probs) |
quantiles where x is the numeric vector whose quantiles are desired
and probs is a numeric vector with probabilities in [0,1]. # 30th and 84th percentiles of x y <- quantile(x, c(.3,.84)) |
fivenum(x) |
min, 1st, 2nd, 3rd Quartiles, and Max |
IQR(x) |
Spread between 25th and 75th percentile |
rank(x) |
takes a group of values and calculates the rank of each value within the group |
diff(range(x)) |
total range of vector x |
diff(x, lag=1) |
lagged differences, with lag indicating which lag to use |
scale(x, center=TRUE, scale=TRUE) |
column center or standardize a matrix. |
NOTE: adding na.rm=TRUE will ignore
missing values in most functions above
psych Packagelibrary(psych)
describe(legosets$Pieces, skew=FALSE)
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
The following table describes functions related to probaility distributions. For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.
| Function | Description |
|---|---|
dt(x, df) |
|
pt(x, df) |
# Example 1: Find the area to the left of a t-statistic with value
of -0.785 and 14 degrees of freedom. pt(-0.785, 14) # Example 2: Find the area to the right of a t-statistic with value of -0.785 and 14 degrees of freedom. #the following approaches produce equivalent results # 1 minus area to the left 1 - pt(-0.785, 14) # area to the right pt(-0.785, 14, lower.tail = FALSE) pt(t-score) = probability (p-value) |
qt(x, df) |
#find the t-score of the 99th quantile of the Student t distribution
with df = 20 qt(.99, df = 20) #find the t-score of the 95th quantile of the Student t distribution with df = 20 qt(.95, df = 20) qt(p-value) = t-test_statistic (t-score) |
rt(x, df) |
|
dnorm(x) |
normal density function (by default m=0 sd=1) # plot standard normal curve x <- pretty(c(-3,3), 30) y <- dnorm(x) plot(x, y, type=‘l’, xlab=“Normal Deviate”, ylab=“Density”, yaxs=“i”) |
pnorm(q) |
cumulative normal probability for q (area under the normal curve to the left of q) pnorm(1.96) is 0.975pnorm(z-score) = probability (p-value) |
qnorm(p) |
normal quantile. value at the p percentile of normal distribution qnorm(.9) is 1.28 # 90th
percentileqnorm(p-value) = z-score |
rnorm(n, m=0,sd=1) |
n random normal deviates with mean m and standard deviation sd.
# 50 random normal variates with mean=50, sd=10 x <- rnorm(50, m=50, sd=10) |
dbinom(1, 4, 0.35) |
The probability of getting exactly one success in 4 trials with 0.35 probability of success |
choose(4,1) |
The number of ways to get 1 success in 4 trials - computes the combination \(_4C_1\) |
dbinom(x, size, prob)pbinom(q, size, prob)qbinom(p, size, prob)rbinom(n, size, prob) |
binomial distribution where size is the sample size and prob is the
probability of a heads # prob of 0 to 5 heads of fair coin out of 10 flips dbinom(0:5, 10, .5) |
pbinom(5, 10, .5)dpois(x, lamda)ppois(q, lamda)qpois(p, lamda)<br>rpois(n, lamda) |
poisson distribution with m=std=lamda # probability of 0,1, or 2 events with lamda=4 dpois(0:2, 4) # probability of at least 3 events with lamda=4 1- ppois(2,4) |
dunif(x, min=0, max=1)punif(q, min=0, max=1)qunif(p, min=0, max=1)runif(n, min=0, max=1) |
uniform distribution, follows the same pattern as the normal
distribution above. # 10 uniform random variates x <- runif(10) |
Note that while the examples on this page apply functions to individual variables, many can be applied to vectors and matrices as well.
A combination does not take into account the order, whereas a permutation does. Using the example from mathsisfun.com:
Great explanation of combinations and permutations including how to calculate in R
Really good explanation at seankross.com/notes/dpqr/
https://www.unc.edu/courses/2008fall/ecol/563/001/images/lectures/lecture3/lecture3.htm#probfunc
Fig. 3 The four probability functions for the normal distribution
There are four basic probability functions for each probability distribution in R. R’s probability functions begins with one of four prefixes: d, p, q, or r followed by a root name that identifies the probability distribution. For the normal distribution the root name is “norm”. The meaning of these prefixes is as follows.
To better understand what these functions do we’ll focus on the four probability functions for the normal distribution: dnorm, pnorm, qnorm, and rnorm. Fig. 3 illustrates the defining relationships among these four functions.
| Function | Description |
|---|---|
table(df$var1, useNA='ifany') |
Creates a table of sums of each value for the variable |
table(df$var1, df$var2, useNA='ifany') |
Creates a table of sums of the inersection of the two variables |
prop.table(table(df$var)) |
Creates a table of the proportion of each value for the variable |
barplot(table(df$var), las=3) |
Creates a bargraph of the values of the variable |
plot(x = df$var1, y = df$var2) |
Creates a scatterplot of the two variables Technically you don’t need the x= and y= as long as you put them first and in that order because by default the first 2 arguments are for the x and y variables |
plot(df$var1, df$var2, type = "l") |
Creates a linegraph of the two variables |
hist(x) |
Creates a histogram of the single variable x |
read.csv("file/location", header = F)
Must use double backslash or forward slashes.
header = F means the original file has no header
Use Foreign Package
install.packages("foreign")
library(foreign)
df <- read.spss("file/location", to.data.frame=T, use.value.labels=T)
Here is some sample code for reading R from a dataset that has been posted in a GitHub repository:
library(RCurl)
x <- getURL("https://raw.github.com/aronlindberg/latent_growth_classes/master/LGC_data.csv")
y <- read.csv(text = x)
source: http://stackoverflow.com/questions/14441729/read-a-csv-from-github-into-r
Make sure you copy the RAW data URL location.
For uniformly distributed (flat) random numbers, use runif(). By default, its range is from 0 to 1.
# Generate a random number from 0 to 1
runif(1)
#> [1] 0.09006613
# Get a vector of 4 random numbers from 0 to 1
runif(4)
#> [1] 0.6972299 0.9505426 0.8297167 0.9779939
# Get a vector of 3 numbers from 0 to 100
runif(3, min=0, max=100)
#> [1] 83.702278 3.062253 5.388360
# Get 3 integers from 0 to 100
# Use max=101 because it will never actually equal 101
floor(runif(3, min=0, max=101))
#> [1] 11 67 1
# This will do the same thing
sample(1:100, 3, replace=TRUE)
#> [1] 8 63 64
# To generate integers WITHOUT replacement:
sample(1:100, 3, replace=FALSE)
#> [1] 76 25 52
To generate numbers from a normal distribution, use rnorm(). By default the mean is 0 and the standard deviation is 1.
rnorm(4)
#> [1] -2.3308287 -0.9073857 -0.7638332 -0.2193786
# Use a different mean and standard deviation
rnorm(4, mean=50, sd=10)
#> [1] 59.20927 40.12440 44.58840 41.97056
# To check that the distribution looks right, make a histogram of the numbers
x <- rnorm(400, mean=50, sd=10)
hist(x)
If you want to generate a sequence of random numbers, and then generate that same sequence again later, use set.seed(), and pass in a number as the seed.
set.seed(423)
runif(3)
#> [1] 0.1089715 0.5973455 0.9726307
set.seed(423)
runif(3)
#> [1] 0.1089715 0.5973455 0.9726307
Use the sample( ) function to take a random sample of size n from a dataset.
# take a random sample of size 50 from a dataset mydata
# sample without replacement
mysample <- mydata[sample(1:nrow(mydata), 50,
replace=FALSE),]
The system.time() function will measure how long it takes to run a particular block of code in R.
system.time({
# Do something that takes time
x <- 1:100000
for (i in seq_along(x)) x[i] <- x[i]+1
})
#> user system elapsed
#> 0.144 0.002 0.153
The output means it took 0.153 seconds to run the block of code.
R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations. The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset.
# select variables v1, v2, v3
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
# another method same as above
myvars <- paste("v", 1:3, sep="")
newdata <- mydata[myvars]
# select 1st and 5th thru 10th variables
newdata <- mydata[c(1,5:10)]
To practice this interactively, try the selection of data frame elements exercises in the Data frames chapter of this introduction to R course.
# exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3")
newdata <- mydata[!myvars]
# exclude 3rd and 5th variable
newdata <- mydata[c(-3,-5)]
# delete variables v3 and v5
mydata$v3 <- mydata$v5 <- NULL
# first 5 observations
newdata <- mydata[1:5, ]
# first 5 variables/columns
newdata <- mydata[ ,1:5]
# row 2 column 6
newdata <- mydata[2,6]
# row 2 and 4 column 6
newdata <- mydata[c(2,4),6]
# based on variable values
# which same as where clause in SQL
newdata <- mydata[ which(mydata$gender=='F'
& mydata$age > 65), ]
# or
attach(mydata)
newdata <- mydata[ which(gender=='F' & age > 65),]
detach(mydata)
# with allows us to specify the columns of a data.frame without having to specify the data.frame name each time...
baseball$OBP <- with(baseball, (h + bb + hbp) / (ab + bb + hbp + sf))
The subset( ) function is the easiest way to select variables and observations. In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10. We keep the ID and Weight columns.
# using subset function
newdata <- subset(mydata, age >= 20 | age < 10,
select=c(ID, Weight))
In the next example, we select all men over the age of 25 and we keep variables weight through income (weight, income and all columns between them).
# using subset function (part 2)
newdata <- subset(mydata, sex=="m" & age > 25,
select=weight:income)
Note: Column names do not need quotes, but values do.
aggregate(var ~ group_by-var, data, function)
# to group by more than one variable, separate with a + sign
aggregate(price ~ cut + color, diamonds, mean, na.rm=TRUE)
# to aggregate by more than one variable, use cbind()
aggregate(cbind(price, carat) ~ cut, diamonds, mean, na.rm=TRUE)
Keep in mind that plyr, dplyr, and data.table are faster.
# returns a vecor of index postions
order(data$var, decreasing=TRUE)
Dplyr does not change the original dataset.
Convert a dataframe into a tbl (tibble)
df <- tbl_df(df)
See structure but better than str()!
glimpse(df)
Use a lookup table to convert codes to values
# The lookup table
lut <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")
# Add the Code column
hflights$Code <- lut[hflights$CancellationCode]
Returns a subset of the columns. Variable names do not need quotes.
select(df, col1, col2, col3, ...)
dplyr comes with a set of helper functions that can help you select
groups of variables inside a select() call:
| Function | Description |
|---|---|
starts_with("X") |
every name that starts with “X”, |
ends_with("X") |
every name that ends with “X”, |
contains("X") |
every name that contains “X”, |
matches("X") |
every name that matches “X”, where “X” can be a regular expression, |
num_range("x", 1:5) |
the variables named x01, x02, x03, x04 and x05, |
one_of(x) |
every name that appears in x, which should be a character vector. |
Pay attention here: When you refer to columns directly inside select(), you don’t use quotes. If you use the helper functions, you do use quotes.
Returns a subset of the rows.
# filters df so that only the observations for which col1 is equal to 1 are kept
filter(df, col1 == 1, col2 != 1)
Boolean operators can be used to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas.
# Exactly the same output
filter(df, a > 0 & b > 0)
filter(df, a > 0, b > 0)
To keep the observations for which the variable x is not NA:
filter(df, !is.na(x))
Don’t forget to use the double equal sign!
Adds new variables(columns) that are functions of existing variables.
# Adds a new column x that is a function of y and z
mutate(df, x = y - z)
# Adds a new column x that is a function of y and z
# and second new column a that is a function of b and c
mutate(df, x = y - z, a = b + c)
Reorders the rows according to single or multiple variables.
Ascending by default
# first by var1 asc
# then by var2 desc,
# then by the sum of x and y
arrange(df, var1, desc(var2), x + y)
Condenses multiple values to a single value. Reduces each group to a single row by calculating aggregate measures.
summarise(df, min = min(x), avg = mean(y))
dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:
| Function | Description |
|---|---|
first(x) |
The first element of vector x. |
last(x) |
The last element of vector x. |
nth(x, n) |
The nth element of vector x. |
n() |
The number of rows in the data.frame or group of observations that summarise() describes. |
n_distinct(x) |
The number of unique values in vector x. |
Next to these dplyr-specific functions, you can also turn a logical
test into an aggregating function with sum() or
mean(). A logical test returns a vector of
TRUE’s and FALSE’s. When you apply
sum() or mean() to such a vector, R coerces
each TRUE to a 1 and each FALSE to a 0.
sum() then represents the total number of observations that
passed the test; mean() represents the proportion.
mpg2 <- (mtcars$mpg > 20)
mpg2
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [25] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
# number and proportion of cars that get greater than 20 mpg
summarise(mtcars, tot = n(), num = sum(mpg2), avg = mean(mpg2))
Pipes require the magrittr or dplyr package.
x %>% mean(na.rm=TRUE) passes the x variable in to
the first argument in the mean function
can be chained together z %>% is.na %>% sum
hflights %>%
mutate(diff = TaxiOut - TaxiIn) %>%
filter(!is.na(diff)) %>%
summarise(avg = mean(diff))
Most useful when using summarise() after grouping.
mtcars %>%
group_by(cyl) %>%
summarise(avg_mpg = mean(mpg))
hflights %>%
group_by(TailNum) %>%
summarise(num = n_distinct(Dest)) %>%
filter(num == 1) %>%
summarise(nplanes = n())
# Find the most visited destination for each carrier
hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarise(n = n()) %>%
mutate(rank = rank(desc(n))) %>%
filter(rank == 1)
to display random samples from your data.
# display exactly 5 random rows
sample_n(mtcars, 5, replace = FALSE)
# display 5 percent of your data rows selected at random
sample_frac(mtcars, 0.05, replace = FALSE)
to save data as a data table
# library(data.table) (already loaded above)
hflights2 <- as.data.table(hflights)
library(ggplot2, quietly = TRUE)
ggplot2 is an R package that provides an alternative
framework based upon Wilkinson’s (2005) Grammar of Graphics.ggplot2 is, in general, more flexible for creating
“prettier” and complex plots.ggplot2 has at
least three ways of creating plots:
qplotggplot(...) + geom_XXX(...) + ...ggplot(...) + layer(...)data(diamonds)
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()
ggplot2 Statementggplot(myDataFrame, aes(x=x, y=y)geom_point(), geom_histogram()facet_wrap(~ cut), facet_grid(~ cut)scale_y_log10()ggtitle('my title'), ylim(c(0, 10000)),
xlab('x-axis label')ls('package:ggplot2')[grep('geom_', ls('package:ggplot2'))]
## [1] "geom_abline" "geom_area" "geom_bar"
## [4] "geom_bin_2d" "geom_bin2d" "geom_blank"
## [7] "geom_boxplot" "geom_col" "geom_contour"
## [10] "geom_contour_filled" "geom_count" "geom_crossbar"
## [13] "geom_curve" "geom_density" "geom_density_2d"
## [16] "geom_density_2d_filled" "geom_density2d" "geom_density2d_filled"
## [19] "geom_dotplot" "geom_errorbar" "geom_errorbarh"
## [22] "geom_freqpoly" "geom_function" "geom_hex"
## [25] "geom_histogram" "geom_hline" "geom_jitter"
## [28] "geom_label" "geom_line" "geom_linerange"
## [31] "geom_map" "geom_path" "geom_point"
## [34] "geom_pointrange" "geom_polygon" "geom_qq"
## [37] "geom_qq_line" "geom_quantile" "geom_raster"
## [40] "geom_rect" "geom_ribbon" "geom_rug"
## [43] "geom_segment" "geom_sf" "geom_sf_label"
## [46] "geom_sf_text" "geom_smooth" "geom_spoke"
## [49] "geom_step" "geom_text" "geom_tile"
## [52] "geom_violin" "geom_vline" "update_geom_defaults"
ggplot(legosets, aes(x=Pieces, y=USD_MSRP)) + geom_point()
ggplot(legosets, aes(x=Pieces, y=USD_MSRP, color=Availability)) + geom_point()
ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures, color=Availability)) + geom_point()
ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures)) + geom_point() + facet_wrap(~ Availability)
ggplot(legosets, aes(x='Lego', y=USD_MSRP)) + geom_boxplot()
ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot()
ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot() + coord_flip()
n <- 1e5
pop <- runif(n, 0, 1)
samp2 <- sample(pop, size=30)
mean(samp2)
## [1] 0.4011481
(samp2.se <- sd(samp2) / sqrt(length(samp2)))
## [1] 0.04725559
The confidence interval is then \(\mu \pm 2 \times SE\)
(samp2.ci <- c(mean(samp2) - 2 * samp2.se, mean(samp2) + 2 * samp2.se))
## [1] 0.3066369 0.4956593
We are 95% confident that the true population mean is between 0.3066369, 0.4956593.
That is, if we were to take 100 random samples, we would expect at least 95% of those samples to have a mean within 0.3066369, 0.4956593.
ci <- data.frame(mean=numeric(), min=numeric(), max=numeric())
for(i in seq_len(100)) {
samp <- sample(pop, size=30)
se <- sd(samp) / sqrt(length(samp))
ci[i,] <- c(mean(samp),
mean(samp) - 2 * se,
mean(samp) + 2 * se)
}
ci$sample <- 1:nrow(ci)
ci$sig <- ci$min < 0.5 & ci$max > 0.5
ggplot(ci, aes(x=min, xend=max, y=sample, yend=sample, color=sig)) +
geom_vline(xintercept=0.5) +
geom_segment() + xlab('CI') + ylab('') +
scale_color_manual(values=c('TRUE'='grey', 'FALSE'='red'))
Likert scales are a type of questionaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).
library(likert)
library(reshape)
data(pisaitems)
items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q']
items24 <- rename(items24, c(
ST24Q01="I read only if I have to.",
ST24Q02="Reading is one of my favorite hobbies.",
ST24Q03="I like talking about books with other people.",
ST24Q04="I find it hard to finish books.",
ST24Q05="I feel happy if I receive a book as a present.",
ST24Q06="For me, reading is a waste of time.",
ST24Q07="I enjoy going to a bookstore or a library.",
ST24Q08="I read only to get information that I need.",
ST24Q09="I cannot sit still and read for more than a few minutes.",
ST24Q10="I like to express my opinions about books I have read.",
ST24Q11="I like to exchange books with my friends."))
likert R Packagel24 <- likert(items24)
summary(l24)
likert Plotsplot(l24)
plot(l24, type='heat')
plot(l24, type='density')
# Inputs
feature_names <- c("Feature 5", "Feature 4", "Feature 3", "Feature 2", "Feature 1")
num_features <- length(feature_names)
y <- array(c(10,4,1,0, 3,4,2,0, 1,2,8,1, 0,0,5,1, 1,2,5,3), dim=c(4,num_features))
# Calculate plot
num_neg_ratings <- 0
num_pos_ratings <- 0
for(i in 1:num_features) {
num_neg_ratings = max(num_neg_ratings, sum(y[1:2,i]), sum(y[3:4,i]))
num_pos_ratings = max(num_pos_ratings, sum(y[1:2,i]), sum(y[1:2,i]))
}
x <- array(0, dim=c(6, num_features))
for(i in 1:num_features) {
x[1, i] <- num_neg_ratings-sum(y[1:2, i])
x[2:5,i] <- y[1:4, i]
x[6, i] <- num_pos_ratings-sum(y[3:4, i])
}
# do the plot
png("/tmp/jbl.png", width=600, height=280)
colors <- c("white","#c91629","#ff5c76","#4e9cff","#0557d6","white")
par(mar=c(4.1,10.1,4.1,4.1))
barplot(x, main="Feature Valence", axes=FALSE,
col=colors, space=1.1, cex.axis=1.0, las=1, border=NA,
names.arg=feature_names, cex=1.0, horiz=TRUE)
axis(
side=1, # X axis
at=c(0, num_neg_ratings/2, num_neg_ratings, num_neg_ratings+(num_pos_ratings/2), num_neg_ratings+num_pos_ratings),
labels=c("Hate","Dislike",NA,"Like","Love")
)
dev.off()
## quartz_off_screen
## 2